Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field:
The processing and recognition of geoscience images have wide applications. Most of existing researches focus on understanding the high-quality geoscience images by assuming that all the images are clear. However, in many real-world cases, the geoscience images might contain occlusions during the image acquisition. This problem actually implies the image inpainting problem in computer vision and multimedia. To the best of our knowledge, all the existing image inpainting algorithms learn to repair the occluded regions for a better visualization quality, they are excellent for natural images but not good enough for geoscience images by ignoring the geoscience related tasks. This paper aims to repair the occluded regions for a better geoscience task performance with the advanced visualization quality simultaneously, without changing the current deployed deep learning based geoscience models. Because of the complex context of geoscience images, we propose a coarse-to-fine encoder-decoder network with coarse-to-fine adversarial context discriminators to reconstruct the occluded image regions. Due to the limited data of geoscience images, we use a MaskMix based data augmentation method to exploit more information from limited geoscience image data. The experimental results on three public geoscience datasets for remote sensing scene recognition, cross-view geolocation and semantic segmentation tasks respectively show the effectiveness and accuracy of the proposed method.
Reinforcement Learning (RL) is a popular machine learning paradigm where intelligent agents interact with the environment to fulfill a long-term goal. Driven by the resurgence of deep learning, Deep RL (DRL) has witnessed great success over a wide spectrum of complex control tasks. Despite the encouraging results achieved, the deep neural network-based backbone is widely deemed as a black box that impedes practitioners to trust and employ trained agents in realistic scenarios where high security and reliability are essential. To alleviate this issue, a large volume of literature devoted to shedding light on the inner workings of the intelligent agents has been proposed, by constructing intrinsic interpretability or post-hoc explainability. In this survey, we provide a comprehensive review of existing works on eXplainable RL (XRL) and introduce a new taxonomy where prior works are clearly categorized into model-explaining, reward-explaining, state-explaining, and task-explaining methods. We also review and highlight RL methods that conversely leverage human knowledge to promote learning efficiency and performance of agents while this kind of method is often ignored in XRL field. Some challenges and opportunities in XRL are discussed. This survey intends to provide a high-level summarization of XRL and to motivate future research on more effective XRL solutions. Corresponding open source codes are collected and categorized at
Dense pose estimation is a dense 3D prediction task for instance-level human analysis, aiming to map human pixels from an RGB image to a 3D surface of the human body. Due to a large amount of surface point regression, the training process appears to be easy to collapse compared to other region-based human instance analyzing tasks. By analyzing the loss formulation of the existing dense pose estimation model, we introduce a novel point regression loss function, named Dense Points} loss to stable the training progress, and a new balanced loss weighting strategy to handle the multi-task losses. With the above novelties, we propose a brand new architecture, named UV R-CNN. Without auxiliary supervision and external knowledge from other tasks, UV R-CNN can handle many complicated issues in dense pose model training progress, achieving 65.0% $AP_{gps}$ and 66.1% $AP_{gpsm}$ on the DensePose-COCO validation subset with ResNet-50-FPN feature extractor, competitive among the state-of-the-art dense human pose estimation methods.
随着方法的发展,反转主要分为两个步骤。第一步是图像嵌入,其中编码器或优化过程嵌入图像以获取相应的潜在代码。之后,第二步旨在完善反转和编辑结果,我们将其命名为“结果”。尽管第二步显着提高了忠诚度,但感知和编辑性几乎没有变化,深处取决于第一步中获得的反向潜在代码。因此,一个关键问题是在保留重建保真度的同时获得更好的感知和编辑性的潜在代码。在这项工作中,我们首先指出,这两个特征与合成分布的逆代码的对齐程度(或不对准)有关。然后,我们提出了潜在空间比对反转范式(LSAP),该范式由评估度量和解决方案组成。具体来说,我们引入了归一化样式空间($ \ Mathcal {s^n} $ space)和$ \ Mathcal {s^n} $ cosine距离(SNCD)以测量反转方法的不对准。由于我们提出的SNCD是可区分的,因此可以在基于编码器和基于优化的嵌入方法中进行优化,以执行均匀的解决方案。在各个域中进行的广泛实验表明,SNCD有效地反映了感知和编辑性,并且我们的对齐范式在两个步骤中都归档了最新的。代码可在上找到。
审议是人类日常生活中的一种共同自然行为。例如,在撰写论文或文章时,我们通常会首先编写草稿,然后迭代地擦亮它们,直到满足为止。鉴于这种人类的认知过程,我们提出了Decom,这是自动评论生成的多通审议框架。 DECOM由多个审议模型和一个评估模型组成。给定代码段,我们首先从代码中提取关键字,然后从预定义的语料库中检索类似的代码片段。然后,我们将检索到的代码的评论视为初始草案,并将其用代码和关键字输入到DETOM中,以开始迭代审议过程。在每次审议时,审议模型都会抛光草案并产生新的评论。评估模型衡量了新生成的评论的质量,以确定是否结束迭代过程。终止迭代过程后,将选择最佳的评论作为目标评论。我们的方法在Java(87K)和Python(108K)的两个现实世界数据集上进行了评估,实验结果表明,我们的方法表现优于最先进的基准。人类评估研究还证实,DECOM产生的评论往往更可读性,信息性和有用。
人类垫子是指从具有高质量的自然图像中提取人类部位,包括人类细节信息,例如头发,眼镜,帽子等。这项技术在电影行业的图像合成和视觉效果中起着至关重要的作用。当绿屏不可用时,现有的人类底漆方法需要其他输入(例如Trimap,背景图像等)或具有较高计算成本和复杂网络结构的模型,这给应用程序带来了很大的困难实践中的人类垫子。为了减轻此类问题,大多数现有方法(例如MODNET)使用多分支为通过细分铺平道路,但是这些方法并未充分利用图像功能,并且仅利用网络的预测结果作为指导信息。因此,我们提出了一个模块来生成前景概率图,并将其添加到MODNET中以获得语义引导的Matting Net(SGM-NET)。在只有一个图像的条件下,我们可以实现人类的效果任务。我们在P3M-10K数据集上验证我们的方法。与基准相比,在各种评估指标中,我们的方法显着改善。
电磁检测卫星调度问题(EDSSP)的研究引起了人们对大量目标的检测要求的关注。本文提出了一个针对EDSSP问题的混合成员编程模型,以及基于强化学习(RL-EA)的进化算法框架。在模型中考虑了影响电磁检测的许多因素,例如检测模式,带宽和其他因素。基于强化学习的进化算法框架使用Q学习框架,并且人群中的每个人都被视为代理。根据提出的框架,设计了一种基于Q的遗传算法(QGA)。 Q学习用于通过选择变异操作员来指导人口搜索过程。在算法中,我们设计了一个奖励功能来更新Q值。根据问题的特征,提出了一种新的组合,采取了行动>。 QGA还使用精英个人保留策略来提高搜索性能。之后,提出了一个任务时间窗口选择算法来评估人口进化的性能。各种量表实验用于检查所提出算法的计划效果。通过对多个实例的实验验证,可以看出QGA可以有效地解决EDSSP问题。与最新的算法相比,QGA算法在几个方面的表现更好。
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
派生是一个重要而基本的计算机视觉任务,旨在消除在下雨天捕获的图像或视频中的雨条纹和累积。现有的派威方法通常会使雨水模型的启发式假设,这迫使它们采用复杂的优化或迭代细化以获得高回收质量。然而,这导致耗时的方法,并影响解决从假设偏离的雨水模式的有效性。在本文中,我们通过在没有复杂的雨水模型假设的情况下,通过在没有复杂的雨水模型假设的情况下制定污染作为预测滤波问题的简单而有效的污染方法。具体地,我们识别通过深网络自适应地预测适当的核的空间变型预测滤波(SPFILT以过滤不同的各个像素。由于滤波可以通过加速卷积来实现,因此我们的方法可以显着效率。我们进一步提出了eFderain +,其中包含三个主要贡献来解决残留的雨迹,多尺度和多样化的雨水模式而不会损害效率。首先,我们提出了不确定感知的级联预测滤波(UC-PFILT),其可以通过预测的内核来识别重建清洁像素的困难,并有效地移除残留的雨水迹线。其次,我们设计重量共享多尺度扩张过滤(WS-MS-DFILT),以处理多尺度雨条纹,而不会损害效率。第三,消除各种雨水模式的差距,我们提出了一种新颖的数据增强方法(即Rainmix)来培养我们的深层模型。通过对不同变体的复杂分析的所有贡献相结合,我们的最终方法在恢复质量和速度方面优于四个单像辐照数据集和一个视频派威数据集的基线方法。
